Connect to RSelenium server from within R which automating web browsers.
#Download binaries, start driver, and get client object.
rd <- rsDriver(browser="firefox")
ffd <- rd$client
#Navigate to the web page
ffd$navigate("https://www.imdb.com/title/tt6751668/reviews?ref_=tt_ql_3")
#Find the load button and assign, then send click event.
load_btn <- ffd$findElement(using = "css selector",".load-more-data")
load_btn$clickElement()
#Wait for elements to load
Sys.sleep(2)
# Repeat the step several times
load_btn$clickElement()
Sys.sleep(2)
#Get HTMl data and parse
Parasite <- ffd$getPageSource()[[1]]
Reviews<- Parasite%>%
read_html%>%
html_nodes(".text.show-more__control")%>%
html_text()
Date<-Parasite%>%
read_html%>%
html_nodes(".review-date")%>%
html_text()
ParasiteDF <- data.frame(Reviews,Date,stringsAsFactors = FALSE)For convinience, I save this dataframe into a csv and read it in.
Remove the non-valid UTF-8 characters.
The term frequency table counts the the frequency of each different words in the reviews of movie Parasite excluing the dulpicate and unmeaningful words. By displaying the top 30 terms with the highest frenquency, we could sense a positive emotion among the reviews. Word such as best, good, and great all reveal an affirmative attitude toward this movie.
Review <- Corpus(VectorSource(ParasiteDF$Reviews))
# Convert the text to lower case
Review <- tm_map(Review, content_transformer(tolower))
# Remove numbers
Review <- tm_map(Review, removeNumbers)
# Remove english common stopwords
Review <- tm_map(Review, removeWords, stopwords("english"))
# Remove punctuations
Review <- tm_map(Review, removePunctuation)
# Eliminate extra white spaces
Review<- tm_map(Review, stripWhitespace)
dtm <- TermDocumentMatrix(Review)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
# Remove the most common words such as film and parasite
d$word <- as.character(d$word)
d$word <- tm::removeWords(d$word, words = c("parasite","film","one","can","make","movie","many","see","every","two","will","just",
"movies","watch","really","makes","get","parasite"))
d<- d %>%
na_if("") %>%
na.omit
head(d, 30)Word cloud
set.seed(1001)
wordcloud2::wordcloud2(data=d, size=1, color='random-dark',minRotation = pi/6, maxRotation = pi/6, rotateRatio = 1)Data Cleaning
ParasiteDF <- ParasiteDF %>%
dplyr::mutate(Reviews = as.character(Reviews),
Reviews=stringr::str_replace_all(Reviews, "[[:punct:]]", ""),
Reviews=stringr::str_replace_all(Reviews, "[[:digit:]]", ""),
Reviews=tolower(Reviews),
Reviews=gsub("\\r\\n","",Reviews),
Reviews=removeWords(words = stopwords("en"),Reviews),
Reviews=gsub("([a-z])([A-Z])", "\\1 \\2", Reviews))
head(ParasiteDF)By aggregating all the reveiws of a specific date together, I used sentiment function to get an overall sentiment score for each day. The table below shows the date, the numbers of words in the reviews on that date, and the sentiment score of the reviews on that date. The proportion of negative review on daily basis is 14.16% and the postive reviews have a much higher percentage 82.30%. This might further prove the finding of positive emotion in frequency table. It is sloppy to make a conclusion that the majority of reviews are postive by just looking a few comments on the first page of IMDB, and that explains why I scaped all the reviews of Parasite from the IMDB to conduct this sentiment analysis.
ParasiteDF1<- ParasiteDF%>%
dplyr::mutate(Date =lubridate::dmy(Date))%>%
dplyr::group_by(Date)%>%
dplyr::summarise(Reviews = paste0(Reviews,collapse = " "),
Reviews = str_squish(Reviews))%>%
dplyr::arrange(desc(Date))
rawsentiment <-sentiment(ParasiteDF1$Reviews, polarity_dt=lexicon::hash_sentiment_nrc)
rawsentiment <- cbind(rawsentiment,ParasiteDF1$Date)%>%
select(V2,word_count,sentiment)
colnames(rawsentiment)[1] = "Date"
rawsentimentNegative and Positive Proportion
pos<-nrow(rawsentiment[sentiment>0,])
neg<-nrow(rawsentiment[sentiment<0,])
negperc <- neg/226
posperc <- pos/226
negperc## [1] 0.1415929
## [1] 0.8230088
The primary reason for aggregating reviews in daily basis is to discover the possible relationship between the sentiment in reviews and time. The graph illustrates that the sentiment of reviews stays almost the same with a slightly increase in score recently.
rawsentiment%>%
ggplot()+
geom_point(mapping=aes(x=Date,
y=sentiment),color='cornflowerblue')+
geom_smooth(mapping=aes(x=Date,
y=sentiment),color='pink3')+
labs(title="Reveiws Sentiment Overtime",
subtitle="From 2019 to 2020")+
theme_classic()+
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
plot.title = element_text(face ="bold",size=18))## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Parasite Cover
library(image.darknet)
f <- system.file(package="image.darknet", "include", "darknet", "data", "imagenet.shortnames.list")
labels <- readLines(f)
weights <- file.path(system.file(package="image.darknet", "models"), "darknet19.weights")
darknet_19 <- image_darknet_model(type = 'classify',
model = "darknet19.cfg", weights = weights, labels = labels)
x <- image_darknet_classify(file = "/Users/yuangao/Documents/Unstructured/hw3/HW3/para.jpg",
object = darknet_19)
x## $file
## [1] "/Users/yuangao/Documents/Unstructured/hw3/HW3/para.jpg"
##
## $type
## label probability
## 1 comic book 0.23770124
## 2 jean 0.06936955
## 3 web site 0.05744276
## 4 maillot 0.05093752
## 5 jersey 0.03305830
Topic models are great for predicting topic probabilities for unseen document, and in this case it’s unrealistic for us to manually sperate a large number of words into different topics. Therefore, topic model could help us clustering the reveiws into appropriate number of topics and finding how does each topic’s percentage changed overtime.
Since prepDocuments function only remove infrenquent word, and we would like to avoid the situition that all the topics including some common words such as movie, the Parasite dataframe removed some duplicated words.
stopwords = c("parasite","film","one","can","make","movie","many","see","every","two","will","just",
"movies","watch","really","makes","get","family")
Re = ParasiteDF1$Reviews
Re = removeWords(Re,stopwords)
ParasiteDF1$Reviews <- Re
ParasiteDF1 <- ParasiteDF1 %>%
dplyr::mutate(Year=lubridate::year(Date))Using the utility function prepDocuments to process the loaded data to make sure it is in the right format.
set.seed(1001)
# partition of Data
holdoutRows = sample(1:nrow(ParasiteDF1), 50, replace = FALSE)
#ParasiteDF1$Date = as.numeric(ParasiteDF1$Date)
prosText = textProcessor(documents = ParasiteDF1$Reviews[-c(holdoutRows)],
metadata = ParasiteDF1[-c(holdoutRows), ],
stem = FALSE)## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Creating Output...
textPrep = prepDocuments(documents = prosText$documents,
vocab = prosText$vocab,
meta = prosText$meta)## Removing 7319 of 12558 terms (7319 of 53206 tokens) due to frequency
## Removing 1 Documents with No Words
## Your corpus now has 175 documents, 5239 terms and 45887 tokens.
The 4 plots below help us find the optimal number of topics to take, and we should select the number of topics with relatively high semantic coherence and low residual. Therefore, the model with 10 topics is picked here. It has the highest semantic coherence value and the lowest residuals among all the numbers of topics I tested. The held-out-likelyhood, which means it’s less likely to generate a suprising result with new words is also reasonable under 10 topics.
kTest = searchK(documents = textPrep$documents, vocab = textPrep$vocab,
K=c(3, 5, 7, 10, 15, 20), verbose=FALSE)
plot(kTest)From this graph, we could tell that topic 8 is the largest proportion and then topic 3…
topics10 = stm(documents = textPrep$documents,
vocab = textPrep$vocab, seed = 1001,
K = 10, verbose = FALSE)
plot(topics10)It is clear to find the expected topic proportions from the grpah above, but three words with the highest freqency in each model can not tell a complete story of what is each topic about. Therefore, the labeltopic gives us more information by showing the words with the highest prob, highest frequency excluding to other topics, Lift and scores. From frex of each topic, we could assign a lable for each topic.
Topic1: Humanity Topic2: Scene Topic3: Director Topic4: Social issue (polarization) Topic5: Theatre & Viewer Topic6: Academy Topic7: Other films Topic8: Actors Topic9: Storyline Topic10:Expectation
## Topic 1 Top Words:
## Highest Prob: well, story, class, bong, kim, park, best
## FREX: spent, human, taek, deeply, looks, body, ground
## Lift: desires, gifted, spent, superiority, transformation, humiliation, taek
## Score: spent, kitaeks, chose, disease, taek, slave, body
## Topic 2 Top Words:
## Highest Prob: rich, poor, like, class, good, way, dont
## FREX: rain, window, smell, underground, murder, play, birthday
## Lift: audiencethe, casts, disparate, doted, ducked, faced, fridge
## Score: tent, rain, dude, traditional, disparate, fridge, inferiority
## Topic 3 Top Words:
## Highest Prob: bong, think, best, poor, rich, like, great
## FREX: think, planning, themes, review, layers, original, want
## Lift: blink, comedydrama, excellently, junho, obtaining, undertones, welloverall
## Score: junho, planning, masters, organism, jokes, kitaeks, paper
## Topic 4 Top Words:
## Highest Prob: story, poor, end, rich, people, like, good
## FREX: played, festival, apartment, end, yeonkyo, humor, evil
## Lift: blur, casualties, composition, dysfunctional, elegance, empathize, focusing
## Score: heaven, clues, views, blur, sydney, roll, casualties
## Topic 5 Top Words:
## Highest Prob: like, story, still, time, people, feel, dont
## FREX: theatre, sat, still, dread, truly, watching, level
## Lift: utter, adaptation, amazement, attend, attended, betrayed, bricks
## Score: eager, sat, utter, dread, amazement, fullblown, stance
## Topic 6 Top Words:
## Highest Prob: best, like, korean, story, good, films, way
## FREX: academy, picture, surely, set, nominees, mean, oscars
## Lift: accurate, believability, blank, competitive, conartist, downright, lucrative
## Score: plain, nominees, multilayered, interior, instance, architecture, stage
## Topic 7 Top Words:
## Highest Prob: best, like, good, poor, great, well, rich
## FREX: oscars, koreans, suspenseful, parts, joker, oscar, picture
## Lift: gala, movieit, slightest, anchor, bravo, carried, confess
## Score: movieit, koreans, symbolizes, native, wellmade, gala, slightest
## Topic 8 Top Words:
## Highest Prob: good, like, best, people, story, way, poor
## FREX: writing, husband, creativity, con, opinion, enjoyed, dramatic
## Lift: hitchcockian, restrained, bubble, dilemma, february, finest, goodthe
## Score: shaped, finest, creativity, slip, sub, hitchcockian, superficial
## Topic 9 Top Words:
## Highest Prob: park, best, korean, bong, story, kim, like
## FREX: mrs, geunsae, kijeong, park, minhyuk, jang, song
## Lift: absorbing, advantages, amenities, architects, artistry, backfires, bled
## Score: geunsae, mrs, taught, minhyuk, kijeong, replace, convinces
## Topic 10 Top Words:
## Highest Prob: like, bong, well, class, kims, park, parks
## FREX: expecting, kims, parks, inequality, part, worth, kiwoo
## Lift: proven, allthe, captivated, controlled, epilogue, esp, expecting
## Score: expecting, relation, popular, relatively, thematic, proven, graphic
Repeat the text processing procedures on test data.
newText = textProcessor(documents = ParasiteDF1$Reviews[holdoutRows],
metadata = ParasiteDF1[holdoutRows, ],
stem = FALSE)## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Creating Output...
Align the old voacbulary in 10 topics to our holdout reviews.
## Your new corpus now has 50 documents, 3301 non-zero terms of 5996 total terms in the original set.
## 2695 terms from the new data did not match.
## This means the new data contained 63.0% of the old terms
## and the old data contained 55.1% of the unique terms in the new data.
## You have retained 18843 tokens of the 21880 tokens you started with (86.1%).
predicting thetas for the holdout document based on the previously fited model.
newFitted = fitNewDocuments(model = topics10, documents = newCorp$documents,
newData = newCorp$meta, origData = textPrep$meta)## ..................................................
Estimation with topical prevalence parameter Year.
topicPredictor = stm(documents = textPrep$documents,
vocab = textPrep$vocab, prevalence = ~ Year,
data = textPrep$meta, K = 10, verbose = FALSE)
timeEffect = estimateEffect(1:10 ~ Year, stmobj = topicPredictor,
metadata = textPrep$meta)
#ratingEffect$Date <- as.numeric(ratingEffect$Date)
summary(timeEffect, topics = c(1:10))##
## Call:
## estimateEffect(formula = 1:10 ~ Year, stmobj = topicPredictor,
## metadata = textPrep$meta)
##
##
## Topic 1:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 130.82424 94.39244 1.386 0.168
## Year -0.06474 0.04675 -1.385 0.168
##
##
## Topic 2:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.28872 88.81219 1.017 0.311
## Year -0.04468 0.04398 -1.016 0.311
##
##
## Topic 3:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.96617 107.46722 0.242 0.809
## Year -0.01280 0.05322 -0.240 0.810
##
##
## Topic 4:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.448726 106.963797 0.116 0.907
## Year -0.006105 0.052971 -0.115 0.908
##
##
## Topic 5:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -145.37812 99.75809 -1.457 0.147
## Year 0.07204 0.04940 1.458 0.147
##
##
## Topic 6:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.84106 90.73958 -0.119 0.905
## Year 0.00541 0.04494 0.120 0.904
##
##
## Topic 7:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18.751858 108.276008 -0.173 0.863
## Year 0.009342 0.053622 0.174 0.862
##
##
## Topic 8:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.04295 111.15252 0.279 0.780
## Year -0.01531 0.05505 -0.278 0.781
##
##
## Topic 9:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.83015 95.19982 0.376 0.707
## Year -0.01770 0.04715 -0.375 0.708
##
##
## Topic 10:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -149.00514 84.20059 -1.77 0.0785 .
## Year 0.07383 0.04170 1.77 0.0784 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The proportion of words in Topic1: Humanity decrease over time.
The proportion of words in Topic2: Scene decrease over time.
The proportion of words in Topic3: Director remain constant over time.
The proportion of words in Topic4: Social issue (polarization) remain constant over time.
The proportion of words in Topic5: Theatre & Viewer increase over time.
The proportion of words in Topic6: Academy remain constant over time.
The proportion of words in Topic7: Other films slgithly increase over time.
The proportion of words in Topic8: Actors slightly decrease over time.
The proportion of words in Topic9: Storyline slightly decrease over time.
The proportion of words in Topic10:Expectation increase over time.
From the trendline of the proprotion of words in each topic overtime, we can tell that there would be more reviews about the storyline, the actors, and the scene just after the movie released. However, as time went by, there would be less comments on the content of movie, and more reviews about expectations and the director.
plot.estimateEffect(timeEffect, "Year", method = "continuous",
model = topicPredictor, topics = 1, labeltype = "frex")plot.estimateEffect(timeEffect, "Year", method = "continuous",
model = topicPredictor, topics = 2, labeltype = "frex")plot.estimateEffect(timeEffect, "Year", method = "continuous",
model = topicPredictor, topics = 3, labeltype = "frex")plot.estimateEffect(timeEffect, "Year", method = "continuous",
model = topicPredictor, topics = 4, labeltype = "frex")plot.estimateEffect(timeEffect, "Year", method = "continuous",
model = topicPredictor, topics = 5, labeltype = "frex")plot.estimateEffect(timeEffect, "Year", method = "continuous",
model = topicPredictor, topics = 6, labeltype = "frex")plot.estimateEffect(timeEffect, "Year", method = "continuous",
model = topicPredictor, topics = 7, labeltype = "frex")plot.estimateEffect(timeEffect, "Year", method = "continuous",
model = topicPredictor, topics = 8, labeltype = "frex")plot.estimateEffect(timeEffect, "Year", method = "continuous",
model = topicPredictor, topics = 9, labeltype = "frex")plot.estimateEffect(timeEffect, "Year", method = "continuous",
model = topicPredictor, topics = 10, labeltype = "frex")A work by Karen